Notes:
library(ggplot2)
data("diamonds")
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(alpha = 1/10, color = "blue") +
stat_smooth(method = "lm", color = "red") +
scale_x_continuous(limits = c(0, quantile(diamonds$carat, 0.99))) +
scale_y_continuous(limits = c(0, quantile(diamonds$price, 0.99)))
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).
Response: I see that price can vary greatly for any given carat. I see that points tend to fall along a given values of carat that are common for normal people to relate to, like whole numbers, half, and quarter values. I also see that price tends to increase as carat increases generally. The relationship is nonlinear. It might be exponential. As carat increases, the variation around price increases. We modeled a linear layer to the graph, but we can see that it doesn’t follow the relationship as well as it should.
Notes: One of the most successful marketing campaigns ever. After the depression, the de Beers cartel targeted the United States. They wanted to convey the sentiment.
Notes:
Notes:
# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
##
## percent
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp,
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What are some things you notice in the ggpairs output? Response: Carat is most highly correlated with price. Dimensions X, Y, and Z are all highly correlated with each other. Price and X, Y, and Z are all highly correlated, but that makes sense since carat is a measure of weight, which is simply a result of X * Y * Z. I see that ideal cut diamonds are the most numerous, and color G is the most common color. I see a right skew in the price data. Price seems to follow a log x scale when price is on the x axis. When price is on the y axis, it follows an exponential curve.
Notes:
library(gridExtra)
plot1 <- ggplot(aes(x = price, fill = I("blue")), data = diamonds) +
geom_histogram(bins = 100) +
ggtitle('Price')
plot2 <- ggplot(aes(x = price, fill = I("orange")), data = diamonds) +
geom_histogram(bins = 50) +
scale_x_log10() +
ggtitle('Price (log10)')
grid.arrange(plot1, plot2, ncol=2)
Notes: It looks like there is a bimodal distribution for demand in diamonds. One distribution of people is in the market for diamonds around a certain price point. Another distribution of people want diamonds of higher price, and are willing to pay more for them.
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).
head(sort(table(diamonds$carat), decreasing = TRUE))
##
## 0.3 0.31 1.01 0.7 0.32 1
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = TRUE))
##
## 605 802 625 828 776 698
## 132 127 126 125 124 121
ggplot(aes(carat, price), data = diamonds) +
geom_point(alpha = 1/2, size = 3/4, position = "jitter") +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).
Notes:
Alter the code below.
# install and load the RColorBrewer package
#install.packages('RColorBrewer')
library(RColorBrewer)
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).
Response: It looks like clarity does explain some of the change in price. Clearer diamonds seem to be higher in price than diamonds of less clarity for any given carat.
Alter the code below.
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).
Response: It looks like cut accounts for some of the variance in price, but not as much as clarity. There are more diamonds of Ideal cut than other types.
Alter the code below.
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color',
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).
Response: Color does seem to influence the price of a diamond. We see clearer bands of colors, where a higher quality colored diamond fetches a higher price for any given carat.
Notes:
Response: We would make a linear model that predicts the log() of price for any cubed root of carat weight (^1/3).
Notes:
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamonds)
##
## =========================================================================
## m1 m2 m3 m4 m5
## -------------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.874*** 0.932*** 0.415***
## (0.006) (0.019) (0.019) (0.017) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 8.703*** 8.438*** 9.144***
## (0.007) (0.032) (0.031) (0.028) (0.016)
## carat -1.137*** -1.163*** -0.992*** -1.093***
## (0.012) (0.011) (0.010) (0.006)
## cut: .L 0.224*** 0.224*** 0.120***
## (0.004) (0.004) (0.002)
## cut: .Q -0.062*** -0.062*** -0.031***
## (0.004) (0.003) (0.002)
## cut: .C 0.051*** 0.052*** 0.014***
## (0.003) (0.003) (0.002)
## cut: ^4 0.018*** 0.018*** -0.002
## (0.003) (0.002) (0.001)
## color: .L -0.373*** -0.441***
## (0.003) (0.002)
## color: .Q -0.129*** -0.093***
## (0.003) (0.002)
## color: .C 0.001 -0.013***
## (0.003) (0.002)
## color: ^4 0.029*** 0.012***
## (0.003) (0.002)
## color: ^5 -0.016*** -0.003*
## (0.003) (0.001)
## color: ^6 -0.023*** 0.001
## (0.002) (0.001)
## clarity: .L 0.907***
## (0.003)
## clarity: .Q -0.240***
## (0.003)
## clarity: .C 0.131***
## (0.003)
## clarity: ^4 -0.063***
## (0.002)
## clarity: ^5 0.026***
## (0.002)
## clarity: ^6 -0.002
## (0.002)
## clarity: ^7 0.032***
## (0.001)
## -------------------------------------------------------------------------
## R-squared 0.9 0.9 0.9 1.0 1.0
## adj. R-squared 0.9 0.9 0.9 1.0 1.0
## sigma 0.3 0.3 0.3 0.2 0.1
## F 652012.1 387489.4 138654.5 87959.5 173791.1
## p 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -7962.5 -3631.3 -1837.4 4235.2 34091.3
## Deviance 4242.8 3613.4 3380.8 2699.2 892.2
## AIC 15931.0 7270.6 3690.8 -8442.5 -68140.5
## BIC 15957.7 7306.2 3762.0 -8317.9 -67953.7
## N 53940 53940 53940 53940 53940
## =========================================================================
Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.
Video Notes:
Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)
Response: This model is a little bit hard to understand when it is predicting the ln() of price, rather than price itself. It also assumes that a diamond of zero weight would already cost 0.415 dollars, which isn’t true. The data is from 2008 also, which means we might want to be accounting for inflation. The market today is different from the price of diamonds 8 years ago. Prices had plummeted in 2008, so this model uses diamonds at the floor of the market. Since 2008, price growth has been uneven across different diamond carat classes.
Notes:
#install.packages('bitops')
#install.packages('RCurl')
library('bitops')
library('RCurl')
#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
load("BigDiamonds.rda")
The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data
Notes:
head(diamondsbig)
## carat cut color clarity table depth cert measurements price
## 1 0.25 V.Good K I1 59 63.7 GIA 3.96 x 3.95 x 2.52 NA
## 2 0.23 Good G I1 61 58.1 GIA 4.00 x 4.05 x 2.30 NA
## 3 0.34 Good J I2 58 58.7 GIA 4.56 x 4.53 x 2.67 NA
## 4 0.21 V.Good D I1 60 60.6 GIA 3.80 x 3.82 x 2.31 NA
## 5 0.31 V.Good K I1 59 62.2 EGL 4.35 x 4.26 x 2.68 NA
## 6 0.20 Good G SI2 60 64.4 GIA 3.74 x 3.67 x 2.38 NA
## x y z
## 1 3.96 3.95 2.52
## 2 4.00 4.05 2.30
## 3 4.56 4.53 2.67
## 4 3.80 3.82 2.31
## 5 4.35 4.26 2.68
## 6 3.74 3.67 2.38
diamondsbig$logprice = log(diamondsbig$price)
bigdiamondsamples = diamondsbig[sample(1:length(diamondsbig$price),10000),]
model1 = lm(logprice ~ I(carat^(1/3)),
data = bigdiamondsamples[bigdiamondsamples$price < 10000 & bigdiamondsamples$cert == "GIA", ])
model2 <- update(m1, ~ . + carat)
model3 <- update(m1, ~ . + cut)
model4 <- update(m1, ~ . + color)
model5 <- update(m1, ~ . + clarity)
mtable(model1, model2, model3, model4, model5)
##
## Calls:
## model1: lm(formula = logprice ~ I(carat^(1/3)), data = bigdiamondsamples[bigdiamondsamples$price <
## 10000 & bigdiamondsamples$cert == "GIA", ])
## model2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## model3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + cut, data = diamonds)
## model4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + color, data = diamonds)
## model5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + clarity, data = diamonds)
##
## =========================================================================
## model1 model2 model3 model4 model5
## -------------------------------------------------------------------------
## (Intercept) 2.687*** 1.039*** 2.703*** 2.586*** 2.441***
## (0.024) (0.019) (0.007) (0.006) (0.005)
## I(carat^(1/3)) 5.819*** 8.568*** 5.622*** 5.767*** 5.965***
## (0.028) (0.032) (0.007) (0.006) (0.006)
## carat -1.137***
## (0.012)
## cut: .L 0.208***
## (0.005)
## cut: .Q -0.062***
## (0.004)
## cut: .C 0.060***
## (0.004)
## cut: ^4 0.018***
## (0.003)
## color: .L -0.415***
## (0.004)
## color: .Q -0.150***
## (0.003)
## color: .C 0.001
## (0.003)
## color: ^4 0.035***
## (0.003)
## color: ^5 -0.021***
## (0.003)
## color: ^6 -0.026***
## (0.003)
## clarity: .L 0.871***
## (0.006)
## clarity: .Q -0.308***
## (0.005)
## clarity: .C 0.175***
## (0.005)
## clarity: ^4 -0.091***
## (0.004)
## clarity: ^5 0.045***
## (0.003)
## clarity: ^6 0.005
## (0.003)
## clarity: ^7 0.045***
## (0.002)
## -------------------------------------------------------------------------
## R-squared 0.9 0.9 0.9 0.9 1.0
## adj. R-squared 0.9 0.9 0.9 0.9 1.0
## sigma 0.3 0.3 0.3 0.3 0.2
## F 44130.8 387489.4 137583.5 118994.5 137535.6
## p 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -1096.9 -3631.3 -6622.7 -1805.5 5300.3
## Deviance 489.1 3613.4 4037.2 3376.8 2594.7
## AIC 2199.8 7270.6 13259.3 3628.9 -10580.6
## BIC 2219.7 7306.2 13321.6 3709.0 -10491.7
## N 5675 53940 53940 53940 53940
## =========================================================================
Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601
#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = as.factor("Very Good"),
color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
# This gives us the model estimate for the log of price, but not the actual price. To get that, we need to use exp.
modelEstimate
## fit lwr upr
## 1 8.420202 8.16803 8.672375
# This gives us actual price.
exp(modelEstimate)
## fit lwr upr
## 1 4537.822 3526.389 5839.352
Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.
The model gives us an estimate with a 95% Confidence interval for where the price could land for a diamond of these particular qualities.
Notes:
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!